Sentiment Analysis of Movie Reviews

This tutorial will guide you through the implementation of a recurrent neural network to analyze movie reviews on IMDB and decide if they are positive or negative reviews.

The IMDB dataset consists of 25,000 reviews, each with a binary label (1 = positive, 0 = negative). Here is an example review:

“Okay, sorry, but I loved this movie. I just love the whole 80’s genre of these kind of movies, because you don’t see many like this...” -~CupidGrl~

The dataset contains a large vocabulary of words, and reviews have variable length ranging from tens to hundreds of words. We reduce the complexity of the dataset with several steps:

Limit the vocabulary size to vocab_size = 20000 words by replacing the less frequent words with a Out-Of-Vocab (OOV) character.
Truncate each example to max_len = 128 words.
For reviews with less than max_len words, pad the review with whitespace. This equalizes the review lengths across examples.

We have already done this preprocessing and saved the data in a pickle file: imdb_data.pkl. The needed file can be downloaded from https://s3-us-west-1.amazonaws.com/nervana-course/imdb_data.pkl and placed in the data directory.



In [ ]:

    
import pickle as pkl

data = pkl.load(open('data/imdb_data.pkl', 'r'))

The data dictionary contains four numpy arrays for the data:

data['X_train'] is an array with shape (20009, 128) for 20009 example reviews, each with up to 128 words.
data['Y_train'] is an array with shape (20009, 1) with a target label (positive=1, negative=0) for each review.
data['X_valid'] is an array with shape (4991, 128) for the 4991 examples in the test set.
data['Y_valid'] is an array with shape (4991, 1) for the 4991 examples in the test set.



In [ ]:

    
print data['X_train'].shape

Compute backend

We first generate a backend to tell neon what hardware to run the model on. This is shared by all neon objects.



In [ ]:

    
from neon.backends import gen_backend

be = gen_backend(backend='gpu', batch_size=128)

To train the model, we use neon's ArrayIterator object which will iterate over these numpy arrays, returning a minibatch of data with each call to pass to the model.



In [ ]:

    
from neon.data import ArrayIterator
import numpy as np
data['Y_train'] = np.array(data['Y_train'], dtype=np.int32)
data['Y_valid'] = np.array(data['Y_valid'], dtype=np.int32)

train_set = ArrayIterator(data['X_train'], data['Y_train'], nclass=2)
valid_set = ArrayIterator(data['X_valid'], data['Y_valid'], nclass=2)

Model Specification

For most of the layers, we randomly initialize the parameters either randomly uniform numbers or Xavier Glorot's initialization scheme.



In [ ]:

    
from neon.initializers import Uniform, GlorotUniform

init_glorot = GlorotUniform()
init_uniform = Uniform(-0.1/128, 0.1/128)

The network consists of sequential list of the following layers:

LookupTable is a word embedding that maps from a sparse one-hot representation to dense word vectors. The embedding is learned from the data.
LSTM is a recurrent layer with “long short-term memory” units. LSTM networks are good at learning temporal dependencies during training, and often perform better than standard RNN layers.
RecurrentSum is a recurrent output layer that collapses over the time dimension of the LSTM by summing outputs from individual steps.
Dropout performs regularization by silencing a random subset of the units during training.
Affine is a fully connected layer for the binary classification of the outputs.



In [ ]:

    
from neon.layers import LSTM, Affine, Dropout, LookupTable, RecurrentSum
from neon.transforms import Logistic, Tanh, Softmax
from neon.models import Model

layers = [
    LookupTable(vocab_size=20000, embedding_dim=128, init=init_uniform),
    LSTM(output_size=128, init=init_glorot, activation=Tanh(),
         gate_activation=Logistic(), reset_cells=True),
    RecurrentSum(),
    Dropout(keep=0.5),
    Affine(nout=2, init=init_glorot, bias=init_glorot, activation=Softmax())
]

# create model object
model = Model(layers=layers)

Cost, Optimizers, and Callbacks

For training, we use the Adagrad optimizer and the Cross Entropy cost function.



In [ ]:

    
from neon.optimizers import Adagrad
from neon.transforms import CrossEntropyMulti
from neon.layers import GeneralizedCost

cost = GeneralizedCost(costfunc=CrossEntropyMulti(usebits=True))
optimizer = Adagrad(learning_rate=0.01)

Callbacks allow the model to report its progress during the course of training. Here we tell neon to save the model every epoch .



In [ ]:

    
from neon.callbacks import Callbacks

model_file = 'imdb_lstm.pkl'
callbacks = Callbacks(model, eval_set=valid_set, serialize=1, save_path=model_file)

Train model

To train the model, we call the fit() function and pass in the training set. Here we train for 2 epochs, meaning two complete passes through the dataset.



In [ ]:

    
model.fit(train_set, optimizer=optimizer, num_epochs=2,
          cost=cost, callbacks=callbacks)

Accuracy

We can then measure the model's accuracy on both the training set but also the held-out validation set.



In [ ]:

    
from neon.transforms import Accuracy

print "Test  Accuracy - {}".format(100 * model.eval(valid_set, metric=Accuracy()))
print "Train Accuracy - {}".format(100 * model.eval(train_set, metric=Accuracy()))